Search Engine Robots - How They Work, What They Do (Part I)

Search Engine Robots - How They Work, What They Do (Part I)
by Daria Goetsch


Automated search engine robots, sometimes called "spiders" or 
"crawlers", are the seekers of web pages. How do they work? 
What is it they really do? Why are they important?

You'd think with all the fuss about indexing web pages to add 
to search engine databases, that robots would be great and
powerful beings. Wrong. Search engine robots have only basic 
functionality like that of early browsers in terms of what they 
can understand in a web page. Like early browsers, robots just 
can't do certain things. Robots don't understand frames, Flash 
movies, images or JavaScript. They can't enter password protected 
areas and they can't click all those buttons you have on your 
website. They can be stopped cold while indexing a dynamically 
generated URL and slowed to a stop with JavaScript navigation.

How Do Search Engine Robots Work?

Think of search engine robots as automated data retrieval 
programs, traveling the web to find information and links. 

When you submit a web page to a search engine at the "Submit a 
URL" page, the new URL is added to the robot's queue of websites 
to visit on its next foray out onto the web. Even if you don't 
directly submit a page, many robots will find your site because 
of links from other sites that point back to yours. This is one 
of the reasons why it is important to build your link popularity 
and to get links from other topical sites back to yours.

When arriving at your website, the automated robots first check 
to see if you have a robots.txt file. This file is used to tell
robots which areas of your site are off-limits to them. Typically 
these may be directories containing only binaries or other files 
the robot doesn't need to concern itself with. 

Robots collect links from each page they visit, and later follow 
those links through to other pages. In this way, they essentially 
follow the links from one page to another. The entire World Wide 
Web is made up of links, the original idea being that you could 
follow links from one place to another. This is how robots get 
around. 

The "smarts" about indexing pages online comes from the search 
engine engineers, who devise the methods used to evaluate the 
information the search engine robots retrieve. When introduced 
into the search engine database, the information is available 
for searchers querying the search engine. When a search engine 
user enters their query into the search engine, there are a 
number of quick calculations done to make sure that the search 
engine presents just the right set of results to give their 
visitor the most relevant response to their query.

You can see which pages on your site the search engine robots 
have visited by looking at your server logs or the results from 
your log statistics program. Identifying the robots will show 
you when they visited your website, which pages they visited and 
how often they visit. Some robots are readily identifiable by 
their user agent names, like Google's "Googlebot"; others are 
bit more obscure, like Inktomi's "Slurp". Still other robots may 
be listed in your logs that you cannot readily identify; some of 
them may even appear to be human-powered browsers. 

Along with identifying individual robots and counting the number 
of their visits, the statistics can also show you aggressive
bandwidth-grabbing robots or robots you may not want visiting 
your website. In the resources section at the end of this 
article, you will find sites that list names and IP addresses 
of search engine robots to help you identify them.

How Do They Read The Pages On Your Website?

When the search engine robot visits your page, it looks at the 
visible text on the page, the content of the various tags in 
your page's source code (title tag, meta tags, etc.), and the 
hyperlinks on your page. From the words and the links that the 
robot finds, the search engine decides what your page is about. 
There are many factors used to figure out what "matters" and 
each search engine has its own algorithm in order to evaluate 
and process the information. Depending on how the robot is set 
up through the search engine, the information is indexed and 
then delivered to the search engine's database. 

The information delivered to the databases then becomes part of 
the search engine and directory ranking process. When the search 
engine visitor submits their query, the search engine digs 
through its database to give the final listing that is displayed 
on the results page.

The search engine databases update at varying times. Once you 
are in the search engine databases, the robots keep visiting you 
periodically, to pick up any changes to your pages, and to make 
sure they have the latest info. The number of times you are 
visited depends on how the search engine sets up its visits, 
which can vary per search engine. 

Sometimes visiting robots are unable to access the website they 
are visiting. If your site is down, or you are experiencing huge 
amounts of traffic, the robot may not be able to access your 
site. When this happens, the website may not be re-indexed, 
depending on the frequency of the robot visits to your website. 
In most cases, robots that cannot access your pages will try 
again later, hoping that your site will be accessible then.

Resources

SpiderSpotting - Search Engine Watch 
http://searchenginewatch.com/webmasters/spiders.html

Robotstxt.org 
List of robots and protocols for setting up a robots.txt file. 
http://www.robotstxt.org/

Spider-Food 
Tutorials, forums and articles about Search Engine spiders and 
Search Engine Marketing. http://spider-food.net/

Spiderhunter.com 
Articles and resources about tracking Search Engine spiders. 
http://www.spiderhunter.com/

Sim Spider Search Engine Robot Simulator 
Search Engine World has a spider that simulates what the Search 
Engine robots read from your website. 
http://www.searchengineworld.com/cgi-bin/sim_spider.cgi 


================================================================
Daria Goetsch is the founder and Search Engine Marketing 
Consultant for Search Innovation Marketing 
(www.searchinnovation.com), a Search Engine Promotion company 
serving small businesses. Besides running her own company, Daria 
is an associate of WebMama.com, an Internet web marketing 
strategies company. She has specialized in search engine 
optimization since 1998, including three years as the Search 
Engine Specialist for O'Reilly & Associates, a technical book 
publishing company.
================================================================